Introduction to R Programming
Welcome
Welcome to the Intro to R Programming Workshop!
Link to Slides: https://favstats.github.io/ds3_r_intro/
Workshop description:
This workshop focuses on the very beginnings of a great journey ahead of you: learning how to use and be comfortable with the statistical programming language R. Together we will explore the basics, from the working environment itself, creating functions for simplifying your tasks, to data management with the tidyverse package. The overarching goal of the workshop is for you to receive the necessary skill set that will enable you to soon embark on your own data science adventures. That being said, the most important aspect of the workshop will be to have fun along the way so that your journey can begin as smoothly and easily as possible.
R as Calculator
At its core R is just a fancy calculator
R can do:
+addition-subtraction*multiplication/division^exponentiation
Mix operators
## [1] 2
Note: when we put a hash tag in front of our code then that is a comment.
Comments are just a useful way for us to add some text to our code, for example to explain what we are doing.
## [1] 8
An animal data set appears..
But adding up numbers for no reason is no fun.
That’s why we will use a data set about animals to learn some R Basics.
Animal Ageing and Longevity Database (John Snow Labs)
Data on over 4200 animals.
Information on age of maturity, gestation or incubation periods but also longevity (in years).
Animal Lifespan Data
Say you want to know how old an animal is in human years.
We can use the following simple formula to determine that:
\(\frac{\text{Maximum lifespan human}}{\text{Maximum lifespan non-human animal}} = \text{animal to human years conversion ratio}\)
Note: This is just a very rough way to determine the conversion ratio. It is much more complicated in reality.
Why are we using longevity instead of.. let’s say average life expectancy?
According to the Animal Ageing and Longevity Database (John Snow Labs):
“[…] [M]aximum longevity (also called maximum lifespan) […] is the most widely used parameter for comparing the rate of aging between species.”
First some important data points.
Human Longevity
The longest human to ever live is a woman named: Jeanne Calment. The french woman reached the age of 122 (!). She was born in 1875 and lived until 1997.
Animal Longevity Stats
Here is a table of maximum longevity for various animals:
| Animal | Maximum Longevity (in years) |
|---|---|
| Human | 122.5 |
| Domestic dog | 24.0 |
| Domestic cat | 30.0 |
| American alligator | 77.0 |
| Golden hamster | 3.9 |
| King penguin | 26.0 |
| Lion | 27.0 |
| Greenland shark | 392.0 |
| Galapagos tortoise | 177.0 |
| African bush elephant | 65.0 |
| California sea lion | 35.7 |
| Fruit fly | 0.3 |
| House mouse | 4.0 |
| Giraffe | 39.5 |
| Wild boar | 27.0 |
Source: Animal Ageing and Longevity Database.
R Objects
Now it can be quite tedious to juggle all those numbers around.
Especially if we want to keep reusing numbers we calculated before.
Here to simplify that process are: R objects.
You can think of R objects as saving information, for example simple numbers or just plain text.
In fact it is good to remember that:
\(\color{blue}{\text{"Everything that exists in R is an object."}}\) ~John M. Chambers
By creating an object we save the information to our working environment and we can recall it whenever we want by just running the name of the object.
We create R objects by using the assignment operator:
<-
or
=
tidyverse style guide recommends to use <- for assignment so that is what we are going to use here.
Here is an example using the small exercise from earlier:
- First, we assign the human max lifespan to a variable called
human_lifespan. The we assign the maximum lifespan of a dog todog_lifespan.
If we now run the respective objects we retrieve the saved numbers.
## [1] 122.5
## [1] 24
Now we can perform the same calculation as before but this time using objects!
The variable dogs_to_human now holds the dog to human
years conversion ratio.
## [1] 5.104167
Now we ask again: how old is a 15 year old dog in human years?
## [1] 76.5625
Same result as earlier!
Cat to Human years
- Calculate cats to human years conversion ratio using objects
- How old is a 15 year old cat in “human years”?
Note that object names could be anything here! I chose to go with short but self-explanatory names but I could have chosen to just name them
x,yorz.
A note on naming stuff
\[\color{blue}{\text{"There are only two hard things in Computer Science:}}\] \[\color{blue}{\text{cache invalidation and naming things. ~ Phil Karlton"}}\]
Artist: Allison Horst
There are many approaches to name things in programming.
In this case we will be using a technique called lower snake case, which is also part of the tidyverse style guide.
Use underscores (_) to separate lower case letter words
within a name.
Example:
Logical Operators
But wait there are more operators! R knows a ton of them.
Logical operators are used for logical tests which can result in either:
\(\text{TRUE}\) or \(\text{FALSE}\)
(sometimes this is also called a boolean variable)
Let’s first create some more objects to try some logical tests!
lion_lifespan <- 27
mouse_lifespan <- 4
fly_lifespan <- 0.3
boar_lifespan <- 27
alligator_lifespan <- 77
greenland_shark_lifespan <- 392
galapagos_tortoise_lifespan <- 177Equal
== asks whether two values are the same
or equal
## [1] TRUE
The code above tests the following statement: > The lifespan of a lion equals the lifespan of a boar.
Since both maximum lifespans are 27 this is of course a
TRUE statement.
Not Equal
!= asks whether two values are the not
the same or unequal
## [1] FALSE
The code above tests the reverse of the above statement: > The lifespan of a lion does not equal the lifespan of a boar.
Since both maximum lifespans are 27 this is of course a
FALSE statement.
Greater or Smaller
We can also test whether certain values are greater or smaller than others:
> greater than
## [1] TRUE
The code above tests the statement: > The lifespan of a human is greater than the lifespan of a fly.
Since the maximum human lifespan is 122.5 and a fly does
not live longer than 0.3 years this is of course a
TRUE statement.
## [1] FALSE
The code above tests the statement: > The lifespan of an alligator is smaller than the lifespan of a mouse.
Since the maximum alligator lifespan is 77 and a mouse
lives for 4 years maximum, this is of course a
FALSE statement.
Also note the following options:
>=greater or equals<=smaller or equals
## [1] TRUE
Combine Logical Operators
We can also combine logical tests by testing multiple statements at the same time:
&stands for “and” (unsurprisingly)|stands for “or”
For example both alligator_lifespan and
fly_lifespan have to be greater than
mouse_lifespan for the code below to evaluate as
TRUE.
## [1] FALSE
If we say | (= or) instead, it means either statement
evaluation to TRUE is enough!
## [1] TRUE
Vectors
Now we learned about objects. But so far they have only ever held a single numeric value.
R is of course much more powerful than that and objects can hold any number and types of data.
Now we will take a look at vectors, or objects that include more than one value.
You can simply imagine vectors as a list of values. They can consist of numbers but also strings (or: text).
In order to create a vector in R we make use of c()
(which stands for concatenate)
## [1] 1 2 3 4
We can combine the lifespans we assigned into objects earlier:
c(greenland_shark_lifespan, dog_lifespan,
galapagos_tortoise_lifespan,
mouse_lifespan, fly_lifespan,
lion_lifespan, boar_lifespan,
alligator_lifespan, human_lifespan)## [1] 392.0 24.0 177.0 4.0 0.3 27.0 27.0 77.0 122.5
We can also create a vector of strings by using quotes:
c("greenland_shark", "dog",
"galapagos_tortoise", "mouse",
"fly", "lion", "boar",
"alligator", "human")## [1] "greenland_shark" "dog" "galapagos_tortoise"
## [4] "mouse" "fly" "lion"
## [7] "boar" "alligator" "human"
Of course we can also assign these vectors to objects.
Let’s say we have a vector animal_lifespans with
different life spans of animals.
animal_lifespans <- c(greenland_shark_lifespan, dog_lifespan,
galapagos_tortoise_lifespan,
mouse_lifespan, fly_lifespan,
lion_lifespan, boar_lifespan,
alligator_lifespan, human_lifespan)Now if we wanted to get the different year conversation ratios we can simply divide the maximum human age number by the vector.
## [1] 0.3125000 5.1041667 0.6920904 30.6250000 408.3333333 4.5370370
## [7] 4.5370370 1.5909091 1.0000000
Notice how the operation is performed for each item separately and the result is yet another vector.
We can also use logical operators with vectors:
## [1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Again, notice how the operation is performed for each item separately
and the result is yet another vector, this time consisting of
TRUE and FALSE.
Small note on Variable Types
There are four main types of variables:
- logical: Boolean/binary, is either
TRUEorFALSE
## [1] "logical"
- character (or string): simple text, including
symbols and numbers
"text"
## [1] "character"
- numeric: Numbers. Mathmatical operators can be used here.
## [1] "numeric"
- factor: characters or strings but ordered in some way
## [1] "factor"
Another important value to know is NA. It stands for
“Not Available” and simply denotes a missing value.
There also more like dates and datetimes but we won’t focus on that now.
## [1] NA 1 2 NA
A logical operator for vectors: %in%
An incredibly useful operator for vectors is
%in%.
The operator checks whether multiple elements occur somewhere in your vectors.
This basic usage looks like this:
\(\color{red}{\text{vector1}}\) %in% \(\color{orange}{\text{vector2}}\)
Let’s first create a vector called animals which
includes all the animals in the same
animals <- c("greenland_shark", "dog",
"galapagos_tortoise", "mouse",
"fly", "lion", "boar",
"alligator", "human")Let’s say we want to check whether giraffe,
greenland_shark or lion occur in
animals.
If we use | we would have to write something like
this:
## [1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
With %in% we can simply pass a vector like this:
## [1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
Where (if at all) does greenlands_shark or
giraffe occur within our animals vector?
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
This might not look too important of an improvement. But what if we had dozens or hundreds of animals to check?
animals == "honey_bee" | animals == "cardiocondyla_obscurior" | animals == "black_garden_ant" | animals == "pheidole_dentata" | animals == "squinting_bush_brown" | animals == "american_lobster" | animals == "firebelly_toad" | animals == "oriental_firebelly_toad" | animals == "yellow_bellied_toad" | animals == "greenland_shark" | animals == "western_toad" | animals == "yosemite_toad" | animals == "great_plains_toad" | animals == "green_toad" | animals == "canadian_toad" | animals == "red_spotted_toad" | animals == "sonoran_green_toad" | animals == "southern_toad" | animals == "veragoa_stubfoot_toad" | animals == "common_european_toad" | animals == "colorado_river_toad" | animals == "kihansi_spray_toad" | animals == "ridge_headed_toad" | animals == "cuban_toad" | animals == "european_green_toad" | animals == "colombian_giant_toad" | animals == "argentine_toad" | animals == "cane_toad" | animals == "eurura_toad" | animals == "common_horned_frog" | animals == "colombian_horned_frog" | animals == "amazonian_horned_frog" | animals == "ornate_horned_frog" | animals == "green_and_black_dart_poison_frog" ## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
With %in% it’s much more readable:
animals_to_check <- c("honey_bee", "cardiocondyla_obscurior", "black_garden_ant", "pheidole_dentata", "squinting_bush_brown", "american_lobster", "firebelly_toad", "oriental_firebelly_toad", "yellow_bellied_toad", "greenland_shark", "western_toad", "yosemite_toad", "great_plains_toad", "green_toad", "canadian_toad", "red_spotted_toad", "sonoran_green_toad", "southern_toad", "veragoa_stubfoot_toad", "common_european_toad", "colorado_river_toad", "kihansi_spray_toad", "ridge_headed_toad", "cuban_toad", "european_green_toad","colombian_giant_toad", "argentine_toad", "cane_toad", "eurura_toad", "common_horned_frog", "colombian_horned_frog", "amazonian_horned_frog", "ornate_horned_frog")## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Indexing
Indexing is done via square brackets [] for when you
want to know a specific value within your object.
\[\color{red}{\text{vector}}[\color{orange}{\text{elements}}]\]
Exracting the first element of a vector:
## [1] 392
## [1] "greenland_shark"
Exracting the fifth element of a vector:
## [1] 0.3
## [1] "fly"
## [1] "greenland_shark" "galapagos_tortoise" "fly"
Indexing with logical tests
You can also index using logical tests.
So if an expression evaluates to TRUE it will
keep that element and when it evaluates to
FALSE it will remove the element.
\[\color{red}{\text{vector}}[\color{orange}{\text{vector of TRUE/FALSE of same length}}]\]
Let’s first take a look at a logical test that extracts all animals that have greater lifespans than humans:
## [1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Now we can use square brackets to only keep those animals that have greater lifespans than humans.
## [1] "greenland_shark" "galapagos_tortoise"
Functions
Artist: Allison Horst
\(\color{blue}{\text{"Everything that happens in R is a function."}}\) ~John M. Chambers
You can think of functions as little machines that (in most cases) process some input and create an output.
Input is everything that goes into a function:
arguments you can think of as (pre-determined) input types like a lever or numpad.
values you can think of as the various settings that the levers or numpads can have.
\[\text{function_name}(\color{orange}{\text{argument}}=\color{lightblue}{\text{value}})\]
The Star Producer
Let’s consider the following function (that does not exist
unfortunately): A star_producer!
This little machine creates tiny hand-drawn stars depending on some input. It takes two arguments:
how_manytells the machine how many stars to producetypetells the machine how the stars should look like (in this case the machine only supports"squiggly"stars but it could be upgraded in the future when we learn how to create our own functions later on)
Note: just like levers or numpads you also have to be careful what value type an argument expects. You can’t type numbers into a lever and a numpad doesn’t do much when you pull it (other than break). Similarly, a certain argument might require you to input a numeric value, or more complex R objects.
Get ?help
But how do you know which function takes what kind of values?
You can find the documentation of a function when you type in
?function_name. This should tell you about the arguments,
how to use the function and what to expect from its output.
Example function: seq
Let’s try this with a function that is included in every fresh R
installation: seq.
From the description we can learn that this function is used to
“[g]enerate regular sequences”.
Its first three arguments are this:
from,to: the starting and (maximal) end values of the sequence.by: number: increment of the sequence.
So we specify from when to where our sequence should go and by which number it should increment.
Let’s first take a look at this within our machine allegory.
If we would want to create a vector from 1 to 10 that increments by 1 we can simple specify the following input values for the arguments:
from: 1to: 10by: 1
This is how that looks like in code:
## [1] 1 2 3 4 5 6 7 8 9 10
Passing Values
Now there are two ways to pass values to functions in R:
- Pass by argument names (we already did this above!)
- Pass by argument position
In the former case, we specifically mention which arguments we want to pass our values to.
For that, it doesn’t matter in which order we pass our arguments.
So this is perfectly valid code that produces the same result as above even if it seems less intuitive:
## [1] 1 2 3 4 5 6 7 8 9 10
But: coders are lazy.
There is no need to always specify which argument you mean exactly when you can just match by position.
So our sequence example could just as well look like this:
## [1] 1 2 3 4 5 6 7 8 9 10
And it works because the documentation tells us that the first three
arguments are from, to, and
by.
In the future you will see it often that people just leave out the arguments completely so it’s good to get used to it.
More examples: Mean and Median
In fact, most of the time functions have such intuitive structures that we often wouldn’t need to even look up the documentation (even though it’s advisable to do).
An easy function to use is mean which simply calculates
the average of a numeric vectors.
Let’s try this with the animal_lifespans we created
earlier.
## [1] 94.53333
Quite old mean value!
## [1] 27
Well when we take the median instead of the mean we can see that this is due to high outliers (the median is of course more robust to extreme values).
There are of course many more functions in R and we will get to learn some of them during this workshop.
Creating our own functions
We can create our own function using the call:
function().
We encode what is supposed to happen within curly brackets
{}.
Here is the anatomy of a function:
\(\color{purple}{\text{name}}\) <- \(\text{function}(\color{orange}{\text{argument}})\){
\(\color{green}{\text{# function body}}\)
\(\color{lightblue}{\text{value}}\) <- \(\color{orange}{\text{argument}}\)
\(\text{return(}\color{lightblue}{\text{value}}\text{)}\)
}
- \(\color{purple}{\text{Function
name}}\):
- An identifier by which the function is called
- \(\color{orange}{\text{Argument(s)}}\):
- Contains a list of values passed to the function
- Can also contain a default value like this:
argument = 1
- \(\color{green}{\text{Function
body}}\):
- This is executed each time the function is called
- \(\color{lightblue}{\text{Return
value}}\):
- Ends function call & sends the value back to the global environment
Square functions
Let’s define a function which squares numeric values.
square <- function(number_to_be_squared) {
squared_number <- number_to_be_squared^2
return(squared_number)
}## [1] 16
Tip: In RStudio we can just type fun and enter after the popup and RStudio will just automatically generate a template for a function.
Dog to Human years function
Let’s create a function that is able to calculate dog years into
human years. We call the function dog_to_human_years.
dog_to_human_years <- function(animal_years){
human_lifespan <- 122.5
dog_lifespan <- 24
ratio <- human_lifespan/dog_lifespan
human_years <- animal_years*ratio
return(human_years)
}## [1] 40.83333
Works perfect!
A quick note on errors and debugging
Artist: Allison Horst
So you encountered an error:
## Error: object of type 'closure' is not subsettable
Steps to take:
Try to understand what it says.
Google (or other search engines)
- Search for the error
- Search for what you were trying to do (add R or rstats)
If the error occurs in a function check whether you passed an object type that isn’t expected by it.
- Checking the documentation can help here! Type:
?function
- Checking the documentation can help here! Type:
Ask for help! Create a reproducible example (reprex) and post to online communities
- Stackoverflow
- Twitter (use #rstats)
- RStudio community (forum)
Artist: Allison Horst
Exercises I
Now it’s your turn to write some R code!
Please open 02_exercises_I.Rmd
for some fun exercises.